Goto

Collaborating Authors

 video synthesis


VideoComposer Compositional Video Synthesis with Motion

Neural Information Processing Systems

The pursuit of controllability as a higher standard of visual content creation has yielded remarkable progress in customizable image synthesis. However, achieving controllable video synthesis remains challenging due to the large variation of temporal dynamics and the requirement of cross-frame temporal consistency. Based on the paradigm of compositional generation, this work presents VideoComposer that allows users to flexibly compose a video with textual conditions, spatial conditions, and more importantly temporal conditions. Specifically, considering the characteristic of video data, we introduce the motion vector from compressed videos as an explicit control signal to provide guidance regarding temporal dynamics. In addition, we develop a Spatio-Temporal Condition encoder (STCencoder) that serves as a unified interface to effectively incorporate the spatial and temporal relations of sequential inputs, with which the model could make better use of temporal conditions and hence achieve higher inter-frame consistency. Extensive experimental results suggest that VideoComposer is able to control the spatial and temporal patterns simultaneously within a synthesized video in various forms, such as text description, sketch sequence, reference video, or even simply hand-crafted motions. The code and models are publicly available at https://videocomposer.github.io.




Streaming Radiance Fields for 3D Video Synthesis

Neural Information Processing Systems

We present an explicit-grid based method for efficiently reconstructing streaming radiance fields for novel view synthesis of real world dynamic scenes. Instead of training a single model that combines all the frames, we formulate the dynamic modeling problem with an incremental learning paradigm in which per-frame model difference is trained to complement the adaption of a base model on the current frame. By exploiting the simple yet effective tuning strategy with narrow bands, the proposed method realizes a feasible framework for handling video sequences on-the-fly with high training efficiency. The storage overhead induced by using explicit grid representations can be significantly reduced through the use of model difference based compression. We also introduce an efficient strategy to further accelerate model optimization for each frame. Experiments on challenging video sequences demonstrate that our approach is capable of achieving a training speed of 15 seconds per-frame with competitive rendering quality, which attains $1000 \times$ speedup over the state-of-the-art implicit methods.


Plan-X: Instruct Video Generation via Semantic Planning

arXiv.org Artificial Intelligence

Diffusion Transformers have demonstrated remarkable capabilities in visual synthesis, yet they often struggle with high-level semantic reasoning and long-horizon planning. This limitation frequently leads to visual hallucinations and mis-alignments with user instructions, especially in scenarios involving complex scene understanding, human-object interactions, multi-stage actions, and in-context motion reasoning. To address these challenges, we propose Plan-X, a framework that explicitly enforces high-level semantic planning to instruct video generation process. At its core lies a Semantic Planner, a learnable multimodal language model that reasons over the user's intent from both text prompts and visual context, and autoregressively generates a sequence of text-grounded spatio-temporal semantic tokens. These semantic tokens, complementary to high-level text prompt guidance, serve as structured "semantic sketches" over time for the video diffusion model, which has its strength at synthesizing high-fidelity visual details. Plan-X effectively integrates the strength of language models in multimodal in-context reasoning and planning, together with the strength of diffusion models in photorealistic video synthesis. Extensive experiments demonstrate that our framework substantially reduces visual hallucinations and enables fine-grained, instruction-aligned video generation consistent with multimodal context.


Echo-Path: Pathology-Conditioned Echo Video Generation

arXiv.org Artificial Intelligence

Cardiovascular diseases (CVDs) remain the leading cause of mortality globally, and echocardiography is critical for diagnosis of both common and congenital cardiac conditions. However, echocardiographic data for certain pathologies are scarce, hindering the development of robust automated diagnosis models. In this work, we propose Echo-Path, a novel generative framework to produce echocardiogram videos conditioned on specific cardiac pathologies. Echo-Path can synthesize realistic ultrasound video sequences that exhibit targeted abnormalities, focusing here on atrial septal defect (ASD) and pulmonary arterial hypertension (PAH). Our approach introduces a pathology-conditioning mechanism into a state-of-the-art echo video generator, allowing the model to learn and control disease-specific structural and motion patterns in the heart. Quantitative evaluation demonstrates that the synthetic videos achieve low distribution distances, indicating high visual fidelity. Clinically, the generated echoes exhibit plausible pathology markers. Furthermore, classifiers trained on our synthetic data generalize well to real data and, when used to augment real training sets, it improves downstream diagnosis of ASD and PAH by 7% and 8% respectively. Code, weights and dataset are available here.




Semantic Communication-Enabled Cloud-Edge-End-collaborative Metaverse Services Architecure

arXiv.org Artificial Intelligence

With technology advancing and the pursuit of new audiovisual experiences strengthening, the metaverse has gained surging enthusiasm. However, it faces practical hurdles as substantial data like high-resolution virtual scenes must be transmitted between cloud platforms and VR devices. Specifically, the VR device's wireless transmission hampered by insufficient bandwidth, causes speed and delay problems. Meanwhile, poor channel quality leads to data errors and worsens user experience. To solve this, we've proposed the Semantic Communication-Enabled Cloud-Edge-End Collaborative Immersive Metaverse Service (SC-CEE-Meta) Architecture, which includes three modules: VR video semantic transmission, video synthesis, and 3D virtual scene reconstruction. By deploying semantic modules on VR devices and edge servers and sending key semantic info instead of focusing on bit-level reconstruction, it can cut latency, resolve the resource-bandwidth conflict, and better withstand channel interference. Also, the cloud deploys video synthesis and 3D scene reconstruction preprocessing, while edge devices host 3D reconstruction rendering modules, all for immersive services. Verified on Meta Quest Pro, the SC-CEE-Meta can reduce wireless transmission delay by 96.05\% and boost image quality by 43.99\% under poor channel condition.


Multimodal Cinematic Video Synthesis Using Text-to-Image and Audio Generation Models

arXiv.org Artificial Intelligence

Advances in generative artificial intelligence have altered multimedia creation, allowing for automatic cinematic video synthesis from text inputs. This work describes a method for creating 60-second cinematic movies incorporating Stable Diffusion for high-fidelity image synthesis, GPT-2 for narrative structuring, and a hybrid audio pipeline using gTTS and YouTube-sourced music. It uses a five-scene framework, which is augmented by linear frame interpolation, cinematic post-processing (e.g., sharpening), and audio-video synchronization to provide professional-quality results. It was created in a GPU-accelerated Google Colab environment using Python 3.11. It has a dual-mode Gradio interface (Simple and Advanced), which supports resolutions of up to 1024x768 and frame rates of 15-30 FPS. Optimizations such as CUDA memory management and error handling ensure reliability. The experiments demonstrate outstanding visual quality, narrative coherence, and efficiency, furthering text-to-video synthesis for creative, educational, and industrial applications.